-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Fix bug #65106: PHP fails to compile ext/fileinfo #10422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
0723fe2
to
e1cf070
Compare
Thanks for the PR!
Looks like the behavior depends on the compiler: https://ci.appveyor.com/project/php/php-src/builds/46012603/job/mutvvhm5rg3jtnie#L1089. Maybe a solution would be to add an option to create_data_file.php (which is not set by default) which creates the new format, and to let users run that script if they need it? |
maybe combine both approaches? |
Yes, I saw AppVeyor's results on Windows :-( For this "two way approach", the ideal would be to have the original, 7 MB, magic.mgc, commited as the authoritative file, So, having an additional data_file_as_string.c (and not a replacement as I first committed) will require to commit either:
I am always reluctant to commiting big and/or generated files, thus 3. would be the most optimal solution, but it is the most complex too. |
Oldies someone? After compiling both versions with all the old gccs and clangs I could find, (and yes I probably killed some polar bears :-( trying by dichotomy to find the minimal necessary memory for each combination. But this is for (unscientific) science… and saving subsequent polar bears by optimizing the millions of compilations of PHP still to occur) Appendices/* data_file_list.c: */
const unsigned char php_magic_database[7015032] = { 0x1C, 0x04, 0x1E, 0xF1, … };
/* data_file_long.c: */
/* /!\ Will require a conversion to unsigned char, works only if array size is multiple of 4, and DOES NOT WORK AS IS ON LITTLE ENDIAN PROCS (that is, all Intel ones): every long has to be reversed. */
const unsigned long php_magic_database[7015032/4] = { 0x1C041EF1, … };
/* data_file_string.c: */
const unsigned char php_magic_database[7015032] = "\x1C\x04\x1E\xF1…";
/* data_file_strings.c: */
const unsigned char php_magic_database[6851][1024] = {
"\x1C\x04\x1E\xF1…", /* 1024 bytes per string */
…, /* 6850 other lines. */
}; for CC in /usr/bin/clang /usr/local/gcc-*/bin/gcc /usr/local/clang-*/bin/clang
do
v="`$CC --version | head -1`"
for f in /tmp/data_file_list.c /tmp/data_file_long.c /tmp/data_file_string.c /tmp/data_file_strings.c
do
t=-
de=0
a=1048576
while [ $a -gt $((de+1024)) ]
do
printf '\r\033[0K%s\t%s\t%s ' "$v" "$de" "$a"
m=$(((de+a)/2))
( ulimit -v $m ; /usr/bin/time -p $CC -c -o /tmp/1.o $f 2> /tmp/1.log >&2 ) \
&& a=$m \
&& t="`sed -e /^real/\!d -e 's#real[^0-9]*##g' < /tmp/1.log`" \
|| de=$m
done
printf '\r\033[0K%d MB\t%s\t%s\t%s\n' $((a/1024)) "$t" "$f" "$v"
done
done On 3 different machines:
/!\ Do not compare between successive versions of a same compiler: some are stock ones, some compiled by myself, some are monolithic and some dynamically linked, and so on. I could even have forgot -O3 to create some of them. … And I found after having run the whole loops for hours, that my tests were unequal, because data_file_list.c had an array of 7015032 bytes, while data_file_string.c had one of 7955032. |
Provide an alternative data_file_string.c to data_string.c, which uses drastically less resources (CPU and RAM) with some compilers (notably GCC and Clang). Use this alternative when one of the aforementioned compilers is detected (see php#10422 (comment)).
Provide a generator for data_file_string.c in ext/fileinfo/Makefile.frag. Ensure data_file_string.c is always synchronized to data_file.c at CI time (see php#10422 (comment)).
e1cf070
to
2715604
Compare
this might increase the sources size substantially, so I would prefer an unified solution did you tried something like |
Yes, I was wondering why in the first place we had this big 44 MB data_file.c in the repository instead of its 7 MB binary source. Do you know the reason?
This is why I thought proposal 3. was the cleanest one.
First form would cause problems with endianness (we would have to provide two files, or one file with macros around each block: that would defeat the goal of reducing its size), |
This comment was marked as spam.
This comment was marked as spam.
I still highly prefer an unified solution and I belive it is possible to achive it. Also, when you look into the source file data, there are many |
This comment was marked as spam.
This comment was marked as spam.
… But afterwards I'm thinking that every installation of PHP since its beginnings (?) uses the flat version, Anyway, I wouldn't pretend modifying the runtime inner workings of ext/fileinfo, I was just here to solve a compile-time problem :-) so it's a bit beyond my goal (and skills). |
…achines Create on-the-fly a data_file_string.c alternative to data_string.c, which uses drastically less resources (CPU and RAM) with some compilers (notably GCC and Clang). Use this alternative when one of the aforementioned compilers is detected (see php#10422 (comment)).
2715604
to
cb44e9c
Compare
OK, now with cb44e9c I think it's much better:
The other good news is that I found mentions to minilua, and I begin to understand how to have a build-time utility compiled from C: |
cb44e9c
to
f8f7fd2
Compare
@mvorisek, would you mind re-running the checks? Github insists on using the forelast version of actions.yml, although I discarded it in the last version :-( |
The data file/var is loaded in |
Yes, you were right, using longs gives a performance boost (see my updated #10422 (comment)). Notes:
So I see it as a (quite acceptable) compromise, |
cb44e9c
to
51b2b00
Compare
I totally agree (I 👍 ed @mvorisek). My only concern (to test now) is on padding: |
cb44e9c
to
97d6bf3
Compare
Woooh, I didn't expected it… but your hybrid "list of strings" solution @mvorisek was even better (for GCC and Clang: less memory, quicker to compile) than the "one pure string" one! |
@weltling wrote:
I feared it too, but with this patch the brave RPi3 finally succeeded into compiling a full PHP 8.2.1 in 90 mn (make -j1 to avoid crashing it again). Yay! |
@outtersg perfect! Did you tested compilation with all exts enabled? What are the next bottlenecks, can we fix more of them? (and possibly reduce the total compile time too) |
No, this is a tuned-for-my-use installation: ./configure \
'--with-zlib' '--with-iconv' '--enable-exif' '--with-gd' '--with-jpeg-dir' \
'--with-readline' '--with-curl' '--enable-fileinfo=shared' '--enable-shared' \
'--enable-mbstring' '--enable-soap' '--enable-sysvsem' '--enable-sysvshm' \
'--with-gettext' '--with-openssl' '--enable-zip' '--enable-sockets' \
'--with-pgsql' '--with-pdo-pgsql' '--without-mysql' '--without-pdo-mysql' \
'--enable-fpm' '--enable-intl' '--enable-phpdbg' '--with-freetype' \
'--enable-calendar' \
'--with-sqlite3=/home/pi/local/sqlite-3.40.1' '--with-pdo-sqlite=/home/pi/local/sqlite-3.40.1'
I could not find any other pathological .c file (C files over 10 MB); this one was really a corner case for multiple compilers. |
97d6bf3
to
adbd704
Compare
OK, now that #10533 has been pushed, correcting the CI for master, @weltling wrote:
… aaaaand wait for master to be CI-stable! OK, now that 18b611d on master made it pass, @mvorisek, we're good to go! |
@outtersg close and reopen PR has worked in the past to re-trigger the CI, maybe worth trying. Thanks |
1df40b9
to
91d42e0
Compare
LGTM, maybe the (original) size should declared in C explicitly, as the NUL appended bytes might be not be expected. @iluuu1994 can you please review? |
Looking at apprentice.c, the only use path of Thus the padding's size as well as its contents has zero incidence on the runtime (… hey, thus I could put an ode to myself in it!), apart from the less-than-1 KB waste. |
Make data_file.c's generator output its array initialization "by list of strings", instead of "by list of chars": that makes the compiler happier. Use strtr() with a precomputed map, instead of a loop over ord(), to generate in 1/100th the time.
Avoid ambiguous 0x tokens (as specified in C standards), by using octal.
Make the (generated) data_file.c's array initialization "by list of strings", instead of "by list of chars": that makes the compiler happier.
In addition to fixing the build with the newly pushed data_file.c (#13369),
|
= On tente d'utiliser les modifs proposées en php/php-src#10422. darcs-hash:b20b797bc93bc04e5940871d382d85a414ec7d97
I put this on my TODO to take a look at. Sorry it took so long to find a reviewer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
I get a reduction of about a factor 8 in time and memory on gcc on Linux.
With Clang I get a similar reduction.
On MSVC the time also goes down a lot, and memory usage goes down from 2.8GB to 1.9GB
This fixes #65106:
on memory constrained machines, the compiler gets killed trying to compile ext/fileinfo/data_file.c (via apprentice.c).
Bug analysis
On 2 machines / 3 compilers (Raspbian with stock gcc on a Raspberry Pi 3 Model B, FreeBSD with compiled gcc 11.2 or clang 13.0.0 on an amd64 VirtualBox guest),
the compiler needs ~1 GB RAM compiling the current data_file.c with list initialization:
but only ~256 MB when using string initialization: